1. Identity statement | |
Reference Type | Conference Paper (Conference Proceedings) |
Site | sibgrapi.sid.inpe.br |
Holder Code | ibi 8JMKD3MGPEW34M/46T9EHH |
Identifier | 8JMKD3MGPAW/3RP2P48 |
Repository | sid.inpe.br/sibgrapi/2018/09.02.02.43 |
Last Update | 2018:09.02.11.29.09 (UTC) administrator |
Metadata Repository | sid.inpe.br/sibgrapi/2018/09.02.02.43.26 |
Metadata Last Update | 2022:06.14.00.09.19 (UTC) administrator |
DOI | 10.1109/SIBGRAPI.2018.00061 |
Citation Key | MaiaJulcHira:2018:MaLeAp |
Title | A Machine Learning approach for Graph-based Page Segmentation |
Format | On-line |
Year | 2018 |
Access Date | 2024, Apr. 27 |
Number of Files | 1 |
Size | 3626 KiB |
|
2. Context | |
Author | 1 Maia, Ana Lucia Lima Marreiros 2 Julca-Aguilar, Frank Dennis 3 Hirata, Nina Sumiko Tomita |
Affiliation | 1 University of São Paulo/State University of Feira de Santana 2 University of São Paulo 3 University of São Paulo |
Editor | Ross, Arun Gastal, Eduardo S. L. Jorge, Joaquim A. Queiroz, Ricardo L. de Minetto, Rodrigo Sarkar, Sudeep Papa, João Paulo Oliveira, Manuel M. Arbeláez, Pablo Mery, Domingo Oliveira, Maria Cristina Ferreira de Spina, Thiago Vallin Mendes, Caroline Mazetto Costa, Henrique Sérgio Gutierrez Mejail, Marta Estela Geus, Klaus de Scheer, Sergio |
e-Mail Address | anamaia@ime.usp.br |
Conference Name | Conference on Graphics, Patterns and Images, 31 (SIBGRAPI) |
Conference Location | Foz do Iguaçu, PR, Brazil |
Date | 29 Oct.-1 Nov. 2018 |
Publisher | IEEE Computer Society |
Publisher City | Los Alamitos |
Book Title | Proceedings |
Tertiary Type | Full Paper |
History (UTC) | 2018-09-02 11:29:09 :: anamaia@ime.usp.br -> administrator :: 2018 2022-06-14 00:09:19 :: administrator -> :: 2018 |
|
3. Content and structure | |
Is the master or a copy? | is the master |
Content Stage | completed |
Transferable | 1 |
Version Type | finaldraft |
Keywords | Page segmentation document image machine learning graph connected components classification convolutional neural network |
Abstract | We propose a new approach for segmenting a document image into its page components (e.g. text, graphics and tables). Our approach consists of two main steps. In the first step, a set of scores corresponding to the output of a convolutional neural network, one for each of the possible page component categories, is assigned to each connected component in the document. The labeled connected components define a fuzzy over-segmentation of the page. In the second step, spatially close connected components that are likely to belong to a same page component are grouped together. This is done by building an attributed region adjacency graph of the connected components and modeling the problem as an edge removal problem. Edges are then kept or removed based on a pre-trained classifier. The resulting groups, defined by the connected subgraphs, correspond to the detected page components. We evaluate our method on the ICDAR2009 dataset. Results show that our method effectively segments pages, being able to detect the nine types of page components. Furthermore, as our approach is based on simple machine learning models and graph-based techniques, it should be easily adapted to the segmentation of a variety of document types. |
Arrangement 1 | urlib.net > SDLA > Fonds > SIBGRAPI 2018 > A Machine Learning... |
Arrangement 2 | urlib.net > SDLA > Fonds > Full Index > A Machine Learning... |
doc Directory Content | access |
source Directory Content | FInal_PaperID_50.pdf | 01/09/2018 23:43 | 3.5 MiB | |
agreement Directory Content | |
|
4. Conditions of access and use | |
data URL | http://urlib.net/ibi/8JMKD3MGPAW/3RP2P48 |
zipped data URL | http://urlib.net/zip/8JMKD3MGPAW/3RP2P48 |
Language | en |
Target File | Final_PaperID_50.pdf |
User Group | anamaia@ime.usp.br |
Visibility | shown |
Update Permission | not transferred |
|
5. Allied materials | |
Mirror Repository | sid.inpe.br/banon/2001/03.30.15.38.24 |
Next Higher Units | 8JMKD3MGPAW/3RPADUS 8JMKD3MGPEW34M/4742MCS |
Citing Item List | sid.inpe.br/sibgrapi/2018/09.03.20.37 9 |
Host Collection | sid.inpe.br/banon/2001/03.30.15.38 |
|
6. Notes | |
Empty Fields | archivingpolicy archivist area callnumber contenttype copyholder copyright creatorhistory descriptionlevel dissemination edition electronicmailaddress group isbn issn label lineage mark nextedition notes numberofvolumes orcid organization pages parameterlist parentrepositories previousedition previouslowerunit progress project readergroup readpermission resumeid rightsholder schedulinginformation secondarydate secondarykey secondarymark secondarytype serieseditor session shorttitle sponsor subject tertiarymark type url volume |
|